Policy Gradient Methods
Policy gradient methods are a family of reinforcement learning algorithms that directly optimize the policy without learning a value function as an intermediate step. They work by adjusting the parameters of the policy in the direction of greater cumulative reward.
Core Concept
Unlike value-based methods (such as Q-learning) that learn which actions are best in each state, policy gradient methods learn a parameterized policy that directly maps states to actions or action probabilities. The policy is optimized by following the gradient of expected return with respect to the policy parameters.
Mathematical Foundation
Let's denote:
- θ as the policy parameters
- πθ(a|s) as the probability of taking action a in state s under policy π with parameters θ
- τ = (s0, a0, r1, s1, a1, ...) as a trajectory
- R(τ) as the total reward of trajectory τ
- p(τ;θ) as the probability of trajectory τ under policy πθ
The objective is to maximize the expected return:
The policy gradient theorem gives us the gradient of this objective:
This simplifies to:
Where Qπ(s,a) is the expected return of taking action a in state s and then following policy π.
REINFORCE Algorithm (Monte Carlo Policy Gradient)
The simplest policy gradient algorithm is REINFORCE, which uses the actual returns from complete episodes to update the policy.
Algorithm Steps:
- Initialize policy parameters θ randomly
- For each episode:
- Generate a trajectory τ by following policy πθ
- For each step t in the trajectory:
- Calculate the return Gt (sum of rewards from step t onwards)
- Update policy parameters: θ ← θ + α∇θlog πθ(at|st)Gt